A run context manager #484

christian-monch · 2023-10-13T12:26:01Z

This PR adds a run-context manager. It is currently a draft PR, because there are open TODOs, but I wanted to start discussions early and would like to highlight that any input on this PR is appreciated.

What's in it

The run-context manager starts a subprocess and returns either a result structure or a generator to the code executed in the context.

If the protocol that is used to create the context does not inherit GeneratorProtocolMixIn, the context manager will execute the specified subprocess and, if the subprocess exits, return the result of protocol._prepare_result() in the as-variable.

If the protocol that is used to create the context does inherit GeneratorProtocolMixIn, the context manager will start the execution of the process and return the resulting generator in the as-variable. The generator can be used to fetch output from the subprocess and to trigger protocol callbacks, e.g. pipe_data_received, timeout etc. If the code exits the context, the context-manager will execute tuple() on the generator to ensure that the subprocess exit is properly handled, i.e. protocol-callbacks are called and the return code of the subprocess is fetched.

The context handler supports the specification of terminate- and/or kill-times. If those are specified the context manager might send a terminate and/or a kill signal to the subprocess. The following happens if terminate- and/or kill-time are specified:

In the non-generator case, the context manager will send a terminate-signal to the process if it does not exit with the terminate-time number of timeouts (if no timeout was specified by the caller, the context manager will use 1 second). It will send a kill-signal, if the process does not exist within (terminate-time + kill-time) number of timeouts.
In the generator case, the context manager will start the terminate/kill logic after the code in the context left the context. The context manager will try to exhaust the generator and send a terminate or kill signal to the process, if the termination or kill condition was reached before the generator was exhausted.

The PR comes with a number of tests that demonstrate the behavior.

Data processors

The PR includes an implementation of a generator that can be configured to read data from a base generator, perform different steps of data processing, and yield the result of the last processing step to the caller. There are currently three data processors:

A "decode" processor that decodes bytes into strings
A "linesplit" processor that splits bytes or strings at line-ends or at the file-end
A "jsonline" processor that converts lines into JSON-objects if possible.

Which processors are used is defined by putting them into a list. For example, the list [decode_processor, splitlines_processor] specifies that the content that is received from the underlying generator should first be decoded and then split into lines. The processing is "stream"-oriented, which means that data is returned as soon as possible, e.g. each individual line is returned the caller as soon as it was read.

Implementation remark

The run-context manager creates a subclass of the provided protocol class in order to amend the provided protocol class with the kill-logic. It uses this subclass to instantiate a runner. This construction allows the user to use any protocol class without the need to define yet another class, while it ensures that the class that is used by the runner supports the kill-logic.

TODO

Improve documentation
Use the run-context manager in respective source code in datalad-next

datalad_next/runners/run.py

mih · 2023-10-13T13:14:23Z

datalad_next/runners/run.py

+        timeout: float | None = None,
+        terminate_time: int | None = None,
+        kill_time: int | None = None,


I think we need a precise definition in the documentation.

What is the difference between timeout and terminate_time?

Is kill_time on top of terminate_time?

What is the semantic difference that leads to one being called timeout and the other just be time?

I want to add a few more comments after having tried to exercise this through in a concrete use case.

It is unclear where the timeouts apply, and thereby it is unclear how they must be used.

Example: I have a process that downloads some information. This information in reported in a json-lines format. The process is batch-mode of some kind (an input triggers a response, continued until stdin closes).

Now there are several layers a user of this helper needs to be aware of.

the layer that feeds the input/request

the process that downloads and eventually reports a JSON-line

the protocol that assembles a JSON-line and decodes it, once complete

Let's assume the remote bandwidth is really low (few bytes per second), and the information that needs to go into a single JSON-line is a few kb, so it needs minutes to assemble a response, and have the "outer" protocol generator yield something.

What would the timeouts apply to? I assume time0 is defined by the input item being fed to the process. Is it now the inner layer (i.e. nothing coming down the remote connection), or the outer layer (no JSON-line-based result yielded for X seconds) that would timeout?

If you use a generator-protocol, then the timeouts that aid the terminating and killing mechanism are only triggered when the context is left. I extended the documentation to clarify that. So for stdout_batchcommand and annexjson_batchcommand, the timeout-counting starts when the context is left, which means when the code in the context decides that all operations are done and exits the context.

Timeout-handling for the transmission should be done in the protocol that is used, e.g. StdOutCaptureGeneratorProtocol. For example, the protocol might send timeout-messages to the result generator and the in-context code can decide what to do. In another example, if the protocol raises an exception in its timeout-callback, the batchcommand-context would be left and the terminating- and killing- timeouts would be started.

I understand the first paragraph and appreciate the clarification.

I have difficulties following the reasoning in the 2nd paragraph. When you say "should be done", you are saying "it is not implemented anywhere"? Do we have any protocol implementation that behaves like this? If not, how do we know that it would work?

datalad_next/runners/run.py

codecov · 2023-10-17T17:30:04Z

Codecov Report

Attention: 50 lines in your changes are missing coverage. Please review.

Comparison is base (33b8f51) 92.41% compared to head (25bf3cf) 92.55%.
Report is 14 commits behind head on main.

❗ Current head 25bf3cf differs from pull request most recent head c42c588. Consider uploading reports for the commit c42c588 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #484      +/-   ##
==========================================
+ Coverage   92.41%   92.55%   +0.13%     
==========================================
  Files         126      138      +12     
  Lines        9577     9774     +197     
  Branches     1036     1086      +50     
==========================================
+ Hits         8851     9046     +195     
- Misses        704      712       +8     
+ Partials       22       16       -6

Files	Coverage Δ
datalad_next/iter_collections/gitworktree.py	`100.00% <100.00%> (ø)`
datalad_next/runners/batch.py	`100.00% <100.00%> (ø)`
datalad_next/runners/data_processor_pipeline.py	`100.00% <100.00%> (ø)`
datalad_next/runners/data_processors/__init__.py	`100.00% <100.00%> (ø)`
datalad_next/runners/data_processors/decode.py	`100.00% <100.00%> (ø)`
datalad_next/runners/data_processors/jsonline.py	`100.00% <100.00%> (ø)`
datalad_next/runners/data_processors/pattern.py	`100.00% <100.00%> (ø)`
datalad_next/runners/run.py	`100.00% <100.00%> (ø)`
datalad_next/runners/tests/test_batch.py	`100.00% <100.00%> (ø)`
datalad_next/runners/tests/test_data_processors.py	`100.00% <100.00%> (ø)`
... and 3 more

... and 10 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

datalad_next/iter_collections/gitworktree.py

datalad_next/runners/data_processors.py

mih · 2023-10-18T06:55:21Z

I posted another round of comments. I am not sure if that is useful, I will stop until I got feedback.

FTR: At 28 commits it has reached my attention span limit. It would be useful for reviewing, if the PR is rebased and the commits resemble more reviewable units (new features and their tests, adoption of not features in previously existing code). ATM the commits resemble more the development history of the features themselves.

christian-monch · 2023-10-20T13:29:52Z

I posted another round of comments. I am not sure if that is useful, I will stop until I got feedback.

FTR: At 28 commits it has reached my attention span limit. It would be useful for reviewing, if the PR is rebased and the commits resemble more reviewable units (new features and their tests, adoption of not features in previously existing code). ATM the commits resemble more the development history of the features themselves.

Will do that as soon as the batch code is settled

This commit adds a run-context-manager that can execute subprocesses and can guarantee that a subprocess is terminated some time after exiting the context. To have a guarantee that the subprocess is terminated, the user has to set `kill_time` and preferably also `terminate_time`.

This commit uses the run-context-manager and the data processing pipeline with a pattern_processor in `url_operations/ssh.py`.

Thie commit uses the data-processing-pipeline in `datalad_next/runners/tests/test_run.py`.

This commit addresses comment datalad#484 (comment) about cluttering the namespace. The commit introduces a convention that is currently only used in the protocol class `KillWrapper`. The single keyword argument of the class is prefixed with `dl_`. We should claim the name space `dl_*` for keyword arguments in the protocols that we define. This commit also fixes the return value doc-string description of `_create_kill_wrapper`.

mih · 2023-10-24T05:36:34Z

datalad_next/iter_collections/gitworktree.py

+    with run(
+            cmd=[
+                'git', 'ls-files',
+                # we rely on zero-byte splitting below
+                '-z',
+                # otherwise take whatever is coming in
+                *args,
+            ],
+            protocol_class=StdOutCaptureGeneratorProtocol,
+            stdin=DEVNULL,
+            cwd=path
+    ) as r:
+        # This code uses the data processor chain to process data. This fixes
+        # a problem with the previous version of the code, where `decode` was
+        # used on every data chunk that was sent tp `pipe_data_received`. But
+        # data is chunked up randomly and might be split in the middle of a
+        # character encoding, leading to weird errors.
+        yield from process_from(
+            data_source=r,
+            processors = [
+                decode_processor('utf-8'),
+                splitlines_processor(separator='\0', keep_ends=False)
+            ]
+        )


Suggested change

with run(

cmd=[

'git', 'ls-files',

# we rely on zero-byte splitting below

'-z',

# otherwise take whatever is coming in

*args,

],

protocol_class=StdOutCaptureGeneratorProtocol,

stdin=DEVNULL,

cwd=path

) as r:

# This code uses the data processor chain to process data. This fixes

# a problem with the previous version of the code, where `decode` was

# used on every data chunk that was sent tp `pipe_data_received`. But

# data is chunked up randomly and might be split in the middle of a

# character encoding, leading to weird errors.

yield from process_from(

data_source=r,

processors = [

decode_processor('utf-8'),

splitlines_processor(separator='\0', keep_ends=False)

]

)

with run(

cmd=[

'git', 'ls-files',

# we rely on zero-byte splitting below

'-z',

# otherwise take whatever is coming in

*args,

],

protocol_class=StdOutCaptureGeneratorProtocol,

processors = [

decode_processor('utf-8'),

splitlines_processor(separator='\0', keep_ends=False)

],

cwd=path,

) as r:

yield from r

I brought it up before, but still think this usage should not be more complex than this. If run() must not have processors= it could be renamed and the replacement could have it.

Thanks. I think I missunderstood your earlier comments.

I have a commit which defines a StdOutCapturePipelineGeneratorProtocol. The protocol takes an optional processor list as a keyword-argument and sends the process-pipeline results to the generator. This is transparent to the run-context-manager and achieves the same terse "ls-files"-implementation, i.e. the code in the context is just:

yield from r

I did not yet put it in because it requires the user of run to select the "right" protocol and to provide a processor list in the protocol_kwargs. That seemed complex to me. And, in contrast to your suggestion (put the pipeline processing into run), which should support any protocol, my approach requires support by the protocol. It would be transparent for the run-context manager code though.

Your suggestion should in principle support any protocol. It would require different treatment of generator- and non-generator protocols, if we want to apply pipelines to non-generator protocols as well. The latter would require the result of the non-generator protocols to adhere to a fixed convention. That seems the be the case, all non-generator protocol results are a dictionary with the keys stdout and stderr. If we want to support processor pipelines on non-generator protocols, the run-context manager would have to pass the values of result_dict['stdout'] and possibly result_dict['stderr'] through the processing pipeline. This would also require a way to specify processor lists for stdout and stderr. Alternatively, we could limit the processors-keyword validity to generator protocols and stdout only.

I am not sure what would be the best approach here. WDYT?

After having worked with adding processing pipelines to non-generator protocols, I would say: "let's not support processing pipelines in non-generator protocols". The non-generator protocols are very much centered around joining individual byte-chunks in their pipe_data_received-method and decoding them into strings in their _prepare_result-method. Without overwriting the _prepare_result-method, it is not even possible to receive undecoded byte-content from a subprocess.

To make use of arbitrary pipelines, pipe_data_received and _prepare_result have to be overridden. That would introduce another "subtree" of protocol-class definitions. Alternatively, we could introduce a "synchronized"-run mode, that would collect all results from a generator-protocol-based run and return them. In fact, such an approach could make the tree of non-generator-protocol definitions obsolete, any generator-protocol could be used in a non-generator way.

I brought the processing protocol classes back. Using those classes the code that executes git ls-files looks like this:

with run( cmd=[ 'git', 'ls-files', # we rely on zero-byte splitting below '-z', # otherwise take whatever is coming in *args, ], protocol_class=StdOutCaptureProcessingGeneratorProtocol, stdin=DEVNULL, cwd=path, protocol_kwargs=dict( processors=[ decode_processor('utf-8'), splitlines_processor(separator='\0', keep_ends=False) ] ) ) as r: # This code uses the data processor chain to process data. This fixes # a problem with the previous version of the code, where `decode` was # used on every data chunk that was sent tp `pipe_data_received`. But # data is chunked up randomly and might be split in the middle of a # character encoding, leading to weird errors. yield from r

mih · 2023-10-24T05:49:33Z

datalad_next/runners/data_processors/decode.py

+""" This module contains data processors for the data pipeline processor
+
+The data processors contained here are:
+
+  - decode_utf8_processor
+
+"""


None of this documentation is accessible in the renderer docs. This hides the fact that all the data_processors modules are documented in the exact same way, making them indistinguishable from each other.

I think the pattern to manually list the main components in a module in the top-level docstring is not optimal. In the renderer documentation this would e automatic/obvious already. Here it adds content that needs to be kept in sync. I believe an __all__ declaration would be a more appropriate alternative.

In general I think that that sorting of code in all the modules is artificially creating the need to describe what matters most. The "public" component tends to be last in the file (although it could be first).

[...]
In general I think that that sorting of code in all the modules is artificially creating the need to describe what matters most. The "public" component tends to be last in the file (although it could be first).

An old C/C++-habit that is seemingly difficult to shake. But I will manage :-)

This commit adds pipeline processing generator protocols. The protocols can be initialized with a list of processors that they will apply to the data that they receive. There is currently no equivalent for non-generator protocols. The reason for that is that those protocols expect every data element that they handle to be `bytes`. That does not fit well with processors like `splitlines_processor` or `decode_processor`. We could could implement a new non-generator protocol "family" that collects whatever comes from a processor pipeline as the values of `stdout`- and `stderr`-keys of the result dictionary. But it is currently not clear when and where that would be used.

mih · 2023-10-25T09:38:17Z

I want to take one aspect from above and bring it up for a dedicated discussion:

I would say: "let's not support processing pipelines in non-generator protocols"

What about "let's not support non-generator protocols" instead?

This PR adds 2k+ lines already. This is ~10% of the entire code base. It is a LOT. I get the impression that some of that is just due to "need to support old stuff that happened to be done". But this is not what this extension is about.

If we can find a paradigm that can be supported with a lean implementation that does not multiply the code base, I would very much prefer that. I do not mind is that has consequences, as long as they are spelled out.

What would happen, if only generator-protocols are supported? What would be the performance impact on real use cases (I am specifically not talking about theory -- I understand that thread-management adds complexity).

christian-monch · 2023-10-25T16:21:07Z

I want to take one aspect from above and bring it up for a dedicated discussion:

I would say: "let's not support processing pipelines in non-generator protocols"

What about "let's not support non-generator protocols" instead?

That is a good question. I think it is very interesting idea (I had a similar hunch when I wrote the developer handbook). From the top of my head, an probably incomplete initial answer:

This PR adds 2k+ lines already. This is ~10% of the entire code base. It is a LOT. I get the impression that some of that is just due to "need to support old stuff that happened to be done". But this is not what this extension is about.

If we can find a paradigm that can be supported with a lean implementation that does not multiply the code base, I would very much prefer that. I do not mind is that has consequences, as long as they are spelled out.
[...]

If we only support generator-protocols, we might get away with exactly one protocol that just forwards all events to a result generator. Check the GenericGeneratorProtocol example in the "Runner with Generators Protocols - Developer handbook" (https://datalad--476.org.readthedocs.build/projects/next/en/476/developer_guide/generator-runner.html) for an example. That approach moves almost all processing out of protocol callbacks into the code that uses the runner.

I think it is conceptually simpler because there are no more "opaque" operations performed in protocol classes. The drawback is that it would require new abstractions to write nice, correct, and re-usable code that processes the events in the for event in event_source:-loop. The data processing pipeline is one concept that could be useful here. So the code base would probably not be reduced in total.

The sweet spot is probably somewhere in the middle. The *ProcessingGeneratorProtocol-classes might be a combination of abilities and flexibility that could: a) reduce the overall number of classes, while b) support a large range of usage scenarios.

[...] What would happen, if only generator-protocols are supported? [...]

Within `datalad-next`:

A reduced number of protocol classes.
Possibly shorter protocol class names due to the removal of the Generator-part.
If we still want to support a synchronous mode, i.e. a mode that behaves as if there was a non-generator-protocol, we could use a wrapper around asynchronous runs. The wrapper would collect outputs from the generator until the process terminates and return those outputs together with the exit code to its caller. (A generic wrapper would either require that all generator-protocols signal data sources, i.e. stdout or stderr, in the same way, or that a provided function is capable of determining the data source.)

Within `datalad`:

A number of runner invocations would have to be changed, to use a wrapper function as described above.

Overview of protocols that are currently used in `datalad-next`:

datalad-next currently uses the following protocols in its application code, i.e. not in the tests:

NoCaptureGeneratorProtocol
StdOutCaptureGeneratorProtocol
StdOutCaptureProcessingGeneratorProtocol
GeneratorAnnexJsonProtocol (from datalad-core)

The tests use the following protocols in addition:

StdOutErrCaptureProcessingGeneratorProtocol
NoCapture (from datalad-core)
StdOutCapture (from datalad-core)
StdErrCapture (from datalad-core)
StdOutErrCapture (from datalad-core)

What would be the performance impact on real use cases (I am specifically not talking about theory -- I understand that thread-management adds complexity).

The performance impact should be negligible because the non-generator protocols and the generator protocols use the same thread-based implementation. Compared to non-generator protocols the generator protocols add the overhead of:

sending results to the result generator queue
going through the result generator state machine, when the next result is fetched

which does not create a lot of performance overhead.

As a remark: if performance becomes a problem, we could look into a select- or poll-based runner on posix, which would expose the same interface as ThreadedRunner and be a drop-in replacement that could automatically be used in posix-environments.

This commit addresses comment: datalad#484 (comment) It removes unrendered doc strings and reorders data processor code so that the data processor is on top of the source files. In addtion it adds `__all__` variables to limit imports to the objects that constitute the user-interface. The commit also adds doc-string rendering for the module `datalad_next.runners.data_processors` This includes markup fixes

mih · 2023-10-26T08:39:37Z

Registering #497 (comment) here.

I did not expect that, but I have to admit that I lost track a bit.

christian-monch · 2023-10-26T10:58:30Z

This PR is superseded by PR #516. I will close it for now.